Systems And Methods For Error Resilience In Video Communication Systems

ABSTRACT

Systems and methods for error resilient transmission and for random access in video communication systems are provided. The video communication systems are based on single-layer, scalable video, or simulcast video coding with temporal scalability, which may be used in video communication systems. A set of video frames or pictures in a video signal transmission is designated for reliable or guaranteed delivery to receivers using secure or high reliability links, or by retransmission techniques. The reliably-delivered video frames are used as reference pictures for resynchronization of receivers with the transmitted video signal after error incidence and for random access.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.11/971,769, filed Jan. 9, 2008 which claims the benefit of U.S.provisional patent application Ser. No. 60/884,148 filed Jan. 9, 2007.Further, this application is related to International patent applicationNos. PCT/US06/028365, PCT/US06/028366, PCT/US06/061815, PCT/US06/62569,PCT/US07/80089, PCT/US07/062357, PCT/US07/065554, PCT/US07/065003,PCT/US06/028367, PCT/US07/063335, PCT/US07/081217, PCT/US07/080089,PCT/US07/083351, PCT/US07/086958, and PCT/US07/089076. All of theaforementioned applications, which are commonly assigned, are herebyincorporated by reference herein in their entireties.

FIELD OF THE INVENTION

The present invention relates to video data communication systems. Inparticular, the invention relates to techniques for providing errorresilience in videoconferencing applications.

BACKGROUND OF THE INVENTION

Providing high quality digital video communications between senders andreceivers over packet-based modern communication networks (e.g., anetwork based on the Internet Protocol (IP)) is technically challenging,at least due to the fact that data transport on such networks istypically carried out on a best-effort basis. Transmission errors inmodern communication networks generally manifest themselves as packetlosses and not as bit errors, which were characteristic of earliercommunication systems. The packet losses often are the result ofcongestion in intermediary routers, and not the result of physical layererrors.

When a transmission error occurs in a digital video communicationsystem, it is important to ensure that the receiver can quickly recoverfrom the error and return to an error-free display of the incoming videosignal. However, in typical digital video communication systems, thereceiver's robustness is reduced by the fact that the incoming data isheavily compressed in order to conserve bandwidth. Further, the videocompression techniques employed in the communication systems (e.g.,state-of-the-art codecs ITU-T H.264 and H.263 or ISO MPEG-2 and MPEG-4codecs) can create a strong temporal dependency between sequential videopackets or frames. In particular, use of motion compensated prediction(e.g., involving the use of P or B frames) codecs creates a chain offrame dependencies in which a displayed frame depends on past frame(s).The chain of dependencies can extend all the way to the beginning of thevideo sequence. As a result of the chain of dependencies, the loss of agiven packet can affect the decoding of a number of the subsequentpackets at the receiver. Error propagation due to the loss of the givenpacket terminates only at an “intra” (I) refresh point, or at a framewhich does not use any temporal prediction at all.

Error resilience in digital video communication systems requires havingat least some level of redundancy in the transmitted signals. However,this requirement is contrary to the goals of video compressiontechniques, which strive to eliminate or minimize redundancy in thetransmitted signals.

On a network that offers differentiated services (e.g., DiffServIP-based networks, private networks over leased lines, etc.), a videodata communication application may exploit network features to deliversome or all of video signal data in a lossless or nearly lossless mannerto a receiver. However, in an arbitrary best-effort network (such as theInternet) that has no provision for differentiated services, a datacommunication application has to rely on its own features for achievingerror resilience. Known techniques (e.g., the Transmission ControlProtocol—TCP) that are useful in text or alpha-numeric datacommunications are not appropriate for video or audio communications,which have the added constraint of low end-to-end delay arising out ofhuman interface requirements. For example, TCP techniques may be usedfor error resilience in text or alpha-numeric data transport. TCP keepson retransmitting data until confirmation that all data is received,even if it involves a delay of several seconds. However, TCP isinappropriate for video data transport in a live or interactivevideoconferencing application because the end-to-end delay, which isunbounded, would be unacceptable to participants.

An aspect of error resilience in video communication systems relates torandom access (e.g., when a receiver joins an existing transmission of avideo signal), which has a considerable impact on compressionefficiency. Instances of random access are, for example, a user whojoins a videoconference, or a user who tunes in to a broadcast. Such auser would have to find a suitable point in the incoming bitstreamsignal to start decoding and be synchronized with the encoder. A randomaccess point is effectively an error resilience feature since at thatpoint any error propagation terminates (or is an error recovery point).Thus, a particular coding scheme, which provides good random accesssupport, will generally have an error resilience technique that providesfor faster error recovery. However, the converse depends on the specificassumptions about the duration and extent of the errors that the errorresilience technique is designed to address. The error resiliencetechnique may assume that some state information is available at thereceiver at the time an error occurs. In such case, the error resiliencetechnique does not assure good random access support.

In MPEG-2 video codecs for digital television systems (digital cable TVor satellite TV), I pictures are used at periodic intervals (typically0.5 sec) to enable fast switching into a stream. The I pictures,however, are considerably larger than their P or B counterparts(typically by 3-6 times) and are thus to be avoided, especially in lowbandwidth and/or low delay applications.

In interactive applications such as videoconferencing, the concept ofrequesting an intra update is often used for error resilience. Inoperation, the update involves a request from the receiver to the senderfor an intra picture transmission, which enables the decoder to besynchronized. The bandwidth overhead of this operation is significant.Additionally, this overhead is also incurred when packet errors occur.If the packet losses are caused by congestion, then the use of the intrapictures only exacerbates the congestion problem.

Another traditional technique for error robustness, which has been usedin the past to mitigate drift caused by mismatch in IDCT implementations(e.g., in the H.261 standard), is to periodically code each macroblockintra mode. The H.261 standard requires forced intra coding every 132times a macroblock is transmitted.

The coding efficiency decreases with increasing percentage ofmacroblocks that are forced to be coded as intra in a given frame.Conversely, when this percentage is low, the time to recover from apacket loss increases. The forced intra coding process requires extracare to avoid motion-related drift, which further limits the encoder'sperformance since some motion vector values have to be avoided, even ifthey are the most effective.

In addition to traditional, single-layer codecs, layered or scalablecoding is a well-known technique in multimedia data encoding. Scalablecoding is used to generate two or more “scaled” bitstreams collectivelyrepresenting a given medium in a bandwidth-efficient manner. Scalabilitycan be provided in a number of different dimensions, namely temporally,spatially, and quality (also referred to as SNR “Signal-to-Noise Ratio”scalability). For example, a video signal may be scalably coded indifferent layers at CIF and QCIF resolutions, and at frame rates of 7.5,15, and 30 frames per second (fps). Depending on the codec's structure,any combination of spatial resolutions and frame rates may be obtainablefrom the codec bitstream. The bits corresponding to the different layerscan be transmitted as separate bitstreams (i.e., one stream per layer)or they can be multiplexed together in one or more bitstreams. Forconvenience in description herein, the coded bits corresponding to agiven layer may be referred to as that layer's bitstream, even if thevarious layers are multiplexed and transmitted in a single bitstream.Codecs specifically designed to offer scalability features include, forexample, MPEG-2 (ISO/IEC 13818-2, also known as ITU-T H.262) and thecurrently developed H.264 Scalable Video Coding extension (known asITU-T H.264 Annex G or MPEG-4 Part 10 SVC). Scalable video coding (SVC)techniques specifically designed for video communication are describedin commonly assigned international patent application No.PCT/US06/028365 “SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAYVIDEOCONFERENCING USING SCALABLE VIDEO CODING”. It is noted that evencodecs that are not specifically designed to be scalable can exhibitscalability characteristics in the temporal dimension. For example,consider an MPEG-2 Main Profile codec, a non-scalable codec, which isused in DVDs and digital TV environments. Further, assume that the codecis operated at 30 fps and that a GOP structure of IBBPBBPBBPBBPBB(period N=15 frames) is used. By sequential elimination of the Bpictures, followed by elimination of the P pictures, it is possible toderive a total of three temporal resolutions: 30 fps (all picture typesincluded), 10 fps (I and P only), and 2 fps (I only). The sequentialelimination process results in a decodable bitstream because the MPEG-2Main Profile codec is designed so that coding of the P pictures does notrely on the B pictures, and similarly coding of the I pictures does notrely on other P or B pictures. In the following, single-layer codecswith temporal scalability features are considered to be a special caseof scalable video coding, and are thus included in the term scalablevideo coding, unless explicitly indicated otherwise.

Scalable codecs typically have a pyramidal bitstream structure in whichone of the constituent bitstreams (called the “base layer”) is essentialin recovering the original medium at some basic quality. Use of one ormore the remaining bitstream(s) (called “the enhancement layer(s)”)along with the base layer increases the quality of the recovered medium.Data losses in the enhancement layers may be tolerable, but data lossesin the base layer can cause significant distortions or complete loss ofthe recovered medium.

Scalable codecs pose challenges similar to those posed by single layercodecs for error resilience and random access. However, the codingstructures of the scalable codecs have unique characteristics that arenot present in single layer video codecs. Further, unlike single layercoding, scalable coding may involve switching from one scalability layerto another (e.g., switching back and forth between CIF and QCIFresolutions).

Simulcasting is a coding solution for videoconferencing that is lesscomplex than scalable video coding but has some of the advantages of thelatter. In simulcasting, two different versions of the source areencoded (e.g., at two different spatial resolutions) and transmitted.Each version is independent, in that its decoding does not depend onreception of the other version. Like scalable and single-layer coding,simulcasting poses similar random access and robustness issues. In thefollowing, simulcasting is considered a special case of scalable coding(where no inter layer prediction is performed) and both are referred tosimply as scalable video coding techniques unless explicitly indicatedotherwise.

Specific techniques for providing error resilience and random access invideo communication systems are described in commonly assignedInternational patent application Nos. PCT/US06/061815, “SYSTEMS ANDMETHODS FOR ERROR RESILIENCE AND RANDOM ACCESS IN VIDEO COMMUNICATIONSSYSTEMS,” and PCT/US07/063335, “SYSTEM AND METHOD FOR PROVIDING ERRORRESILIENCE, RANDOM ACCESS, AND RATE CONTROL IN SCALABLE VIDEOCOMMUNICATIONS.” Among other things. these patent applications disclosethe concept of LR pictures, i.e., pictures that constitute the lowesttemporal layer of a scalably coded video signal (at the lowest spatialor quality resolution) and which are transmitted reliably from a senderto a receiver. Reliable transmission of the LR pictures ensures aminimum level of quality at a receiving decoder. A receiver canimmediately detect if an LR picture has been lost and take steps toobtain the lost picture (e.g., by requesting its retransmission from thesender) using, for example, a “key picture indices” mechanism, which isalso disclosed in International patent application No. PCT/US06/061815.It is noted that the sender and receiver are not necessarily the encoderand decoder, respectively, but may be a Scalable Video CommunicationServer (SVCS) as disclosed in commonly assigned International patentapplication No. PCT/US06/028366, a Compositing SVCS (CSVCS) as disclosedin commonly assigned International patent application No.PCT/US06162569, or a Multicast SVCS (MSVCS) as disclosed in commonlyassigned International patent application No. PCT/US07/80089.

A potential limitation of the systems and methods described inInternational patent application No. PCT/US06/061815 occurs when thelowest temporal level pictures are transported over more than onepackets. This may occur, for example, in coding high-definition video,where each frame may be transported using more than one transport-layerpackets, or when a picture is coded using more than one slices and eachslice is transported in its own packet. In these cases, all packetsbelonging to the same frame will have the same key picture index. If allslices are lost due to packet losses in the network, then a receiver canproperly detect the loss of the entire picture and initiate correctiveaction. If however, few or all of the slices are received, then areceiver can not immediately infer if the received slices contain theentire or only a partial picture, unless it proceeds to decode the slicedata. This inference is straightforward in a receiver that decodes thereceived data, but it presents significant complexity for anintermediate receiver (e.g., an SVCS, CSVCS, or MSCVS, or anyMedia-Aware Network Element—MANE) that is normally not equipped toperform decoding of the video data.

Consideration is now being given to improving error resilience to thecoded bitstreams in video communications systems. Attention is directedtowards developing error resilience techniques which have a minimalimpact on end-to-end delay and the bandwidth used by the system, andaddress the possibility of fragmentation of coded video data in multipleslices. Desirable error resilience techniques will be applicable to bothscalable and single-layer video coding.

SUMMARY OF THE INVENTION

The present invention provides systems and methods to increase errorresilience in video communication systems based on single-layer as wellas scalable video coding. Specifically, the present invention provides amechanism for a receiver to detect if portions of a picture that isintended to be transmitted reliably have been lost due to packet losses,so that corrective action can be initiated with minimal delay. Specifictechniques are provided for transmission over RTP as well as when usingH.264 Annex G (SVC) NAL units.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an exemplary video conferencingsystem for delivering scalably coded video data, in accordance with theprinciples of the present invention;

FIG. 2 is a block diagram illustrating an exemplary end-user terminalcompatible with the use of single layer video coding, in accordance withthe principles of the present invention;

FIG. 3 is a block diagram illustrating an exemplary end-user terminalcompatible with the use of scalable or simulcast coding, in accordancewith the principles of the present invention;

FIG. 4 is a block diagram illustrating the internal switching structureof a multipoint SVCS, in accordance with the principles of the presentinvention;

FIG. 5 is a block diagram illustrating the principles of operation of anSVCS;

FIG. 6 is a block diagram illustrating the structure of an exemplaryvideo encoder, in accordance with the principles of the presentinvention;

FIG. 7 is a block diagram illustrating an exemplary architecture of avideo encoder for encoding base and temporal enhancement layers, inaccordance with the principles of the present invention;

FIG. 8 is a block diagram illustrating an exemplary architecture of avideo encoder for a spatial enhancement layer, in accordance with theprinciples of the present invention;

FIG. 9 is a block diagram illustrating an exemplary layered picturecoding structure, in accordance with the principles of the presentinvention;

FIG. 10 is a block diagram illustrating another exemplary layeredpicture coding structure, in accordance with the principles of thepresent invention;

FIG. 11 is a block diagram illustrating an exemplary picture codingstructure including temporal and spatial scalability, in accordance withthe principles of the present invention;

FIG. 12 is a block diagram illustrating an exemplary layered picturecoding structure used for error resilient video communications, inaccordance with the principles of the present invention;

FIG. 13 is a block diagram illustrating an exemplary picture codingstructure used for error resilient video communications withspatial/quality scalability, in accordance with the principles of thepresent invention.

FIG. 14 is a block diagram illustrating an exemplary architecture of thetransmitting terminal's LRP (Snd) module when the R-packets technique isused for transmission over RTP, in accordance with the principles of thepresent invention.

FIG. 15 is a block diagram illustrating an exemplary architecture of thereceiving terminal's LRP (Rev) module when the R-packets technique isused for transmission over RTP, in accordance with the principles of thepresent invention.

FIG. 16 is a block diagram illustrating an exemplary architecture of theserver's LRP Snd and Rcv modules when the R-packets technique is usedfor transmission over RTP, in accordance with the principles of thepresent invention.

FIG. 17 illustrates an exemplary structure for the named RTP headerextension for RTP packets, in accordance with the principles of thepresent invention.

FIG. 18 illustrates an exemplary structure for the feedback controlinformation field of RNACK packets, in accordance with the principles ofthe present invention.

FIG. 19 illustrates a modified H.264 Annex G (SVC) NAL header extensionsyntax with frame indices and start/end flags, in accordance with theprinciples of the present invention.

Throughout the figures the same reference numerals and characters,unless otherwise stated, are used to denote like features, elements,components or portions of the illustrated embodiments. Moreover, whilethe present invention will now be described in detail with reference tothe Figures, it is done so in connection with the illustrativeembodiments.

DETAILED DESCRIPTION OF THE INVENTION

The present invention provides systems and methods for error resilienttransmission in video communication systems. The mechanisms arecompatible with scalable video coding techniques as well as single-layerand simulcast video coding with temporal scalability, which may be usedin video communication systems.

The system and methods involve designating a set of video frames orpictures in a video signal transmission for reliable or guaranteeddelivery to receivers. Reliable delivery of the designated set videoframes may be accomplished by using secure or high reliability links, orby retransmission techniques. The reliably-delivered video frames areused as reference pictures for resynchronization of receivers with thetransmitted video signal after error incidence or for random access.

In a preferred embodiment, an exemplary video communication system maybe a multi-point videoconferencing system 10 operated over apacket-based network. (See e.g., FIG. 1). Multi-point videoconferencingsystem may include optional bridges 120 a and 120 b (e.g., MultipointControl Unit (MCU) or Scalable Video Communication Server (SVCS)) tomediate scalable multilayer or single layer video communications betweenendpoints (e.g., users l-k and l-m) over the network. The operation ofthe exemplary video communication system is the same and as advantageousfor a point-to-point connection with or without the use of optionalbridges 120 a and 120 b.

A detailed description of scalable video coding techniques andvideoconferencing systems based on scalable video coding is provided incommonly assigned International patent application No. PCT/US06/028365“SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USINGSCALABLE VIDEO CODING”, No. PCT/US06/028266 “SYSTEM AND METHOD FOR ACONFERENCE SERVER ARCHITECTURE FOR LOW DELAY AND DISTRIBUTEDCONFERENCING APPLICATIONS”, No. PCT/US/06/062569 “SYSTEM AND METHOD FORVIDEOCONFERENCING USING SCALABLE VIDEO CODING AND COMPOSITING SCALABLEVIDEO SERVERS”, and No. PCT/US07/80089 “SYSTEM AND METHOD FOR MULTIPOINTCONFERENCING WITH SCALABLE VIDEO CODING SERVERS AND MULTICAST”. Further,descriptions of error resilience, random access, and rate controltechniques are provided in commonly assigned International patentapplications No. PCT/US06/061815 “SYSTEMS AND METHODS FOR ERRORRESILIENCE AND RANDOM ACCESS IN VIDEO COMMUNICATION SYSTEMS” and No.PCT/US07/063335 “SYSTEM AND METHOD FOR PROVIDING ERROR RESILIENCE,RANDOM ACCESS, AND RATE CONTROL IN SCALABLE VIDEO COMMUNICATIONS”. Allof the aforementioned International patent applications are incorporatedby reference herein in their entireties. The systems and methods of thepresent invention improve upon the systems and methods described inInternational patent application No. PCT/US06/61815.

FIG. 1 shows the general structure of a videoconferencing system 10.Videoconferencing system 10 includes a plurality of end-user terminals(e.g., users l-k and users l-m) that are linked over a network 100 viaLANS 1 and 2 and servers 120 a and 120 b. The servers may be traditionalMCUs, Scalable Video Communication Servers (SVCS), Compositing ScalableVideo Communication Servers (CSVCS), or Multicast Scalable VideoCommunication Servers (MSVCS). The latter servers have the same purposeas traditional MCUs, but with significantly reduced complexity andimproved functionality. (See e.g., International patent application No.PCT/US06/28366). In the description herein, the term “server” may beused generically to refer to any of the SVCS types.

FIG. 2 shows the architecture of an end-user terminal 140, which isdesigned for use with videoconferencing systems (e.g., system 100) basedon single layer coding. Similarly, FIG. 3 shows the architecture of anend-user terminal 140, which is designed for use with videoconferencingsystems (e.g., system 10) based on multi layer coding. Terminal 140includes human interface input/output devices (e.g., a camera 210A, amicrophone 210B, a video display 250C, a speaker 250D), and one or morenetwork interface controller cards (NICs) 230 coupled to input andoutput signal multiplexer and demultiplexer units (e.g., packet MUX 220Aand packet DMUX 220B). NIC 230 may be a standard hardware component,such as an Ethernet LAN adapter, or any other suitable network interfacedevice, or a combination thereof.

Camera 210A and microphone 210B are designed to capture participantvideo and audio signals, respectively, for transmission to otherconferencing participants. Conversely, video display 250C and speaker250D are designed to display and play back video and audio signalsreceived from other participants, respectively. Video display 250C mayalso be configured to optionally display participant/terminal 140's ownvideo. Camera 210A and microphone 210B outputs are coupled to video andaudio encoders 210G and 210H via analog-to-digital converters 210E and210F, respectively. Video and audio encoders 210G and 210H are designedto compress input video and audio digital signals in order to reduce thebandwidths necessary for transmission of the signals over the electroniccommunications network. The input video signal may be live, orpre-recorded and stored video signals. The encoders compress the localdigital signals in order to minimize the bandwidth necessary fortransmission of the signals.

In an exemplary embodiment of the present invention, the audio signalmay be encoded using any suitable technique known in the art (e.g.,G.711, G.729, G.729EV, MPEG-1, etc.). In a preferred embodiment of thepresent invention, the scalable audio codec G.729EV is employed by audioencoder 210G to encode audio signals. The output of audio encoder 210Gis sent to multiplexer MUX 220A for transmission over network 100 viaNIC 230.

Packet MUX 220A may perform traditional multiplexing using the RTPprotocol. Packet MUX 220A may also perform any related Quality ofService (QoS) processing that may be offered by network 100. Each streamof data from terminal 140 is transmitted in its own virtual channel or“port number” in IP terminology.

FIG. 3 shows the end-user terminal 140, which is configured for use withvideoconferencing systems in which scalable or simulcast video coding isused. In this case, video encoder 210G has multiple outputs. FIG. 3shows, for example, two layer outputs, which are labeled as “base” and“enhancement”. The outputs of terminal 140 (i.e., the single layeroutput (FIG. 2) or the multiple layer outputs (FIG. 3)) are connected toPacket MUX 220A via an LRP processing module 270A. LRP processing module270A (and modules 270B) are designed for error resilient communications(“error resilience LRP operation”) by processing transmissions ofspecial types of frames (e.g. “R” frames, FIGS. 12 and 13) as well asany other information that requires reliable transmission such as videosequence header data. If video encoder 210G produces more than oneenhancement layer output, then each may be connected to LRP processingmodule 270A in the same manner as shown in FIG. 3. Similarly, in thiscase, the additional enhancement layers will be provided to videodecoders 230A via LRP processing modules 270B. Alternatively, one ormore of the enhancement layer outputs may be directly connected toPacket MUX 220A, and not via LRP processing module 270A.

Terminal 140 also may be configured with a set of video and audiodecoder pairs 230A and 230B, with one pair for each participant that isseen or heard at terminal 140 in a videoconference. It will beunderstood that although several instances of decoders 230A and 230B areshown in FIGS. 2 and 3, it is possible to use a single pair of decoders230A and 230B to sequentially process signals from multipleparticipants. Thus, terminal 140 may be configured with a single pair ora fewer number of pairs of decoders 230A and 230B than the number ofparticipants.

The outputs of audio decoders 230B are connected to an audio mixer 240,which in turn is connected with a digital-to-analog converter (DA/C)250A, which drives speaker 250B. The audio mixer combines the individualsignals into a single output signal for playback. If the audio signalsarrive pre-mixed, then audio mixer 240 may not be required. Similarly,the outputs of video decoders 230A may be combined in the frame buffer250B of video display 250C via compositor 260. Compositor 260 isdesigned to position each decoded picture at an appropriate area of theoutput picture display. For example, if the display is split into foursmaller areas, then compositor 260 obtains pixel data from each of videodecoders 230A and places it in the appropriate frame buffer position(e.g., by filling up the lower right picture). To avoid double buffering(e.g., once at the output of decoder 230A and once at frame buffer250B), compositor 260 may be implemented as an address generator thatdrives the placement of the output pixels of decoder 230A, Othertechniques for optimizing the placement of the individual video outputsto display 250C can also be used to similar effect.

For example, in the H.264 standard specification, it is possible tocombine views of multiple participants in a single coded picture byusing a flexible macroblock ordering (FMO) scheme. In this scheme, eachparticipant occupies a portion of the coded image, comprising one of itsslices. Conceptually, a single decoder can be used to decode allparticipant signals. However, from a practical view, thereceiver/terminal will have to decode four smaller independently codedslices. Thus, terminal 140 shown in FIG. 2 and FIG. 3 with decoders 230Amay be used in applications of the H.264 specification. It is noted thatthe server for forwarding slices is an CSVCS.

In terminal 140, demultiplexer DMUX 220B receives packets from NIC 320and redirects them to the appropriate decoder unit 230A via receivingLRP modules 270B as shown in FIGS. 2 and 3. LRP modules 270B at theinputs of video decoders 230A terminate the error resilience LRP at thereceiving terminal end.

The MCU or SERVER CONTROL block 280 coordinates the interaction betweenthe server (SVCS/CSVCS) and the end-user terminals. In a point-to-pointcommunication system without intermediate servers, the SERVER CONTROLblock is not needed. Similarly, in non-conferencing applications, only asingle decoder is needed at a receiving end-user terminal. Forapplications involving stored video (e.g., broadcast of pre-recorded,pre-coded material), the transmitting end-user terminal may not involvethe entire functionality of the audio and video encoding blocks or ofall the terminal blocks preceding them (e.g., camera, microphone, etc.).Specifically, only the portions related to selective transmission ofvideo packets, as explained below, need to be provided.

It will be understood that the various components of terminal 140 may bephysically separate software and hardware devices or units that areinterconnected to each other (e.g., integrated in a personal computer),or may be any combination thereof.

FIG. 4 shows the structure of an exemplary SVCS 400 for use in errorresilient processing applications. The core of the SVCS 400 is a switch410 that determines which packet from each of the possible sources istransmitted to which destination and over what channel. (See e.g.,PCT/US06/028366).

The principles of operation of an exemplary SVCS 400 can be understoodwith refrence to FIG. 5. A SVC Encoder 510 at a transmitting terminal orendpoint in this example produces three spatial layers in addition to anumber of temporal layers (not shown pictorially). The individual codedvideo layers are transmitted from the transmitting endpoint (SVCEncoder) to SVCS 400 in individual packets. SVCS 400 decides whichpackets to forward to each of the three recipient/decoders 520 shown,depending on network conditions or user preferences. In the exampleshown in FIG. 5, SVCS 400 forwards only the first and second spatiallayers to SVC Decoder 520(0), all three spatial layers to SVC Decoder520(1), and only the first (base) layer to SVC Decoder 520(2).

With renewed reference to FIG. 4, in addition to the switch, which isdescribed in PCT/US06/028366, SVCS 400 includes LRP units 470A and 470B,which are disposed at the switch inputs and outputs, respectively. SVCS400 is configured to terminate error resilience LRP processing at itsincoming switch connection, and to initiate error resilience LRPprocessing at its outgoing switch connections. In implementations of theinvention using SVCS 400, error resilience LRP processing is notperformed end-to-end over the network, but only over each individualconnection segment (e.g., sender-to-SVCS, SVCS-to-SVCS, andSVCS-to-recipient). It will, however, be understood that the inventiveerror resilience LRP processing may be executed in an end-to-end fashionover the network, with or without the use an SVCS. An SVCS 400 withoutLRP units 470A and 470E can be used for end-to-end LRP processing innetworks in which an SVCS is used. Further, SVCS 400 may be equippedwith more than one NIC 230, as would typically be the case if SVCS 400connects users across different networks.

FIG. 6 shows the architecture of an exemplary video encoder 600 that maybe used for in error resilient video communication systems. Videoencoder 600 may, for example, be a motion-compensated, block-basedtransform coder. An H.264 design is a preferred design for video encoder600. However, other codec designs may be used. For example, FIG. 7 showsthe architecture of an exemplary video encoder 600′ for encoding baseand temporal enhancement layers based on SVC designs, whereas FIG. 8shows the architecture of an exemplary video encoder 600″ for encodingspatial enhancement layers. (See e.g., PCT/US06/28365 andPCT/US06/028366). Video encoder 600′and 600″ include an optional inputdownsampler 640, which can be utilized to reduce the input resolution(e.g., from CIF to CIF) in systems using spatial scalability.

FIG. 6 also shows a coding process, which may be implemented using videoencoder 600. ENC REF CONTROL 620 in encoder 600 is used to create a“threaded” coding structure. (See e.g., PCT/US06/28365 andPCT/US06/028366). Standard block-based motion compensated codecs have aregular structure of I, P, and B frames. For example, in a picturesequence (in display order) such as IBBPBBP, the ‘P’ frames arepredicted from the previous P or I frame, whereas the B pictures arepredicted using both the previous and next P or I frame. Although thenumber of B pictures between successive I or P pictures can vary, as canthe rate in which I pictures appear, it is not possible, for example,for a P picture to use as a reference for prediction another P picturethat is earlier in time than the most recent one. H.264 is an exceptionin that the encoder and decoder maintain two reference picture lists. Itis possible to select which pictures are used for references and alsowhich references are used for a particular picture that is to be coded.The FRAME BUFFERS block 610 in FIG. 6 represents the memory that storesthe reference picture list(s), whereas ENC REF CONTROL 620 determines—atthe encoder side—which reference picture is to be used for the currentpicture.

The operation of ENC REF CONTROL 520 can be better understood withreference to FIG. 9, which shows an exemplary layered picture codingstructure 900. In order to enable multiple temporal resolutions, thecodec used in the video communications system may generate a number ofseparate picture “threads.” A thread at a given level is defined as asequence of pictures that are motion compensated using pictures eitherfrom the same thread, or pictures from a lower level thread. The use ofthreads allows the implementation of temporal scalability, since one caneliminate any number of top-level threads without affecting the decodingprocess of the remaining threads.

In a preferred embodiment of the present invention, a coding structurewith a set of three threads is used (e.g., structure 900, FIG. 9). InFIG. 9, the letter ‘L’ in the picture labels indicates an arbitraryscalability layer. The numbers (0, 1 and 2) following L identify thetemporal layer, for example, with “0” corresponding to the lowest, orcoarsest temporal layer and “2” corresponding the highest or finesttemporal layer. The arrows shown in FIG. 9 indicate the direction,source, and target of prediction. In most applications only P pictureswill be used, as the use of B pictures increases the coding delay by thetime it takes to capture and encode the reference pictures used for theB pictures. However, in applications that are not delay sensitive, someor all of the pictures could be B pictures with the possible exceptionof LO pictures. Similarly, the LO pictures may be I pictures forming atraditional group of pictures (GOP).

With continued reference to FIG. 9, layer L0 is simply a series ofregular P pictures spaced four pictures apart. Layer L1 has the sameframe rate as L0, but prediction is only allowed from the previous L0frame. Layer L2 frames are predicted from the most recent L0 or L1frame. L0 provides one fourth (1:4) of the full temporal resolution, L1doubles the L0 frame rate (1:2), and L2 doubles the L0+L1 frame rate(1:1).

More or fewer layers than the three L0, L1 and L2 layers discussed abovemay be similarly constructed in coding structures designed toaccommodate the different bandwidth/scalability requirements of specificimplementations of the present invention. FIG. 11: shows an example of athreaded coding structure 1000 with only two layers L0 and L1. Further,FIG. 11 shows an example of a threaded coding structure 1100 for spatialscalability. Coding structure 1100 includes threads for enhancementlayers, which are denoted by the letter ‘S’. It will be noted that theenhancement layer frames may have a different threading structure thanthe base layer frames.

Video encoder 600′ (FIG. 7) for encoding temporal layers may beaugmented to encode spatial and/or quality enhancement layers. (Seee.g., PCT/US06/028365 and PCT/US06/028366). FIG. 8 shows an exemplaryencoder 600″ for the spatial enhancement layer. The structure andfunctions of encoder 600″ are similar to that of the base layer codec600′, except in that base layer information is also available to theencoder 600″. This information may consist of motion vector data,macroblock mode data, coded prediction error data, or reconstructedpixel data. Encoder 600″ can re-use some or all of this data in order tomake coding decisions for the enhancement layers S. The data has to bescaled to the target resolution of the enhancement layer (e.g., byfactor of 2 if the base layer is QCIF and the enhancement layer is CIF).Although spatial scalability typically requires two coding loops to bemaintained, it is possible, for example, under the H.264 Annex G (SVC)draft standard, to perform single-loop decoding by limiting the data ofthe base layer that is used for enhancement layer coding to only valuesthat are computable from the information encoded in the currentpicture's base layer. (See e.g., T. Wiegand, G. Sullivan, J. Reichel, H.Schwarz, M. Wien, eds., “Joint Draft 8 of SVC Amendment,” Joint VideoTeam, Doc. JVT-U201, Hangzhou, October 2006, incorporated herein byreference in its entirety). For example, if a base layer macroblock isinter-coded, then the enhancement layer cannot use the reconstructedpixels of that macroblock as a basis for prediction. It can, however,use its motion vectors and the prediction error values since they areobtainable by just decoding the information contained in the currentbase layer picture. Single-loop decoding is desirable since thecomplexity of the decoder is significantly decreased.

Quality or SNR scalability enhancement layer codecs may be constructedin the manner as spatial scalability codecs. For quality scalability,instead of building the enhancement layer on a higher resolution versionof the input, the codecs code the residual prediction error at the samespatial resolution. As with spatial scalability, all the macroblock dataof the base layer can be re-used at the enhancement layer, in eithersingle- or dual-loop coding configurations. For brevity, the descriptionherein is generally directed to techniques using spatial scalability. Itwill, however, be understood that the same techniques are applicable toquality scalability.

International patent application PCT/US06/028365 describes the distinctadvantages that threading coding structures (e.g., coding structure 900)have in terms of their robustness to the presence of transmissionerrors. In traditional state-of-the-art video codecs based onmotion-compensated prediction, temporal dependency is inherent. Anypacket losses at a given picture not only affects the quality of thatparticular picture, but also affects all future pictures for which thegiven picture acts as a reference, either directly or indirectly. Thisis because the reference frame that the decoder can construct for futurepredictions will not be the same as the one used at the encoder. Theensuing difference, or drift, can have tremendous impact on the visualquality produced by traditional state-of-the-art video codecs.

In contrast, the threading structure shown in FIG. 9 creates threeself-contained threads or chains of dependencies. A packet lossoccurring for an L2 picture will only affect L2 pictures; the L0 and L1pictures can still be decoded and displayed. Similarly, a packet lossoccurring at an L1 picture will only affect L1 and L2 pictures; the L0pictures can still be decoded and displayed. Further, threadingstructures may be created to include threads or chains of dependenciesfor S pictures (e.g., FIG. 11). The exemplary S packets threadingstructure 1100 shown in FIG. 11 has similar properties as the L picturethreading structure 900 shown in FIG. 9. A loss occurring at an S2picture only affects the particular picture, whereas a loss at an S1picture will also affect the following S2 picture. In either case, driftwill terminate upon decoding of the next S0 picture.

With renewed reference to FIG. 9, a packet loss occurring at an L0picture can be catastrophic in terms of picture quality, since allpicture types will be affected. As previously noted, a traditionalsolution to this problem is to periodically code L0 pictures as intra orI pictures. However, the bandwidth overhead for implementing thissolution can be considerable as the I pictures are typically 3-6 timeslarger than P pictures. Furthermore, the packet loss, which gives riseto the need to use an I picture, is often the result of networkcongestion. Attempting to send an I picture over the network to remedythe packet loss only exacerbates the congestion problem.

If the base layer L0 and some enhancement layer pictures are transmittedin a way that guarantees their delivery, the remaining layers can betransmitted on a best-effort basis without catastrophic results in thecase of a packet loss. Such guaranteed transmissions can be performedusing known techniques such as DiffServ, and FEC, etc. In thedescription herein, reference also may be made to a High ReliabilityChannel (HRC) and Low Reliability Channel (LRC) as the two actual orvirtual channels that offer such differentiated quality of service (FIG.1). (See e.g., PCT/US06/028365 and PCT/US06/028366). In videocommunication systems which use scalable video coded structures (e.g.,structure 1100, FIG. 11), layers L0-L2 and S0 may, for example, bereliably transmitted on the HRC, while S1 and S2 are transmitted on theLRC. Although the loss of an S1 or S2 packet would cause limited drift,it is still desirable to be able to conceal as much of the loss ofinformation as possible.

The error resilience techniques described in International patentapplication No. PCT/US06/061815 overcome the limitations of traditionaltechniques for compensating for packet loss by utilizing reliabletransmission of a subset of the L0 layer or the entire L0 layer. Errorresilience or reliability is ensured by retransmissions. These errorresilience techniques are designed not merely to recover a lost picturefor display purposes, but are designed to create the correct referencepicture for the decoding of future pictures that depend on the one thatwas contained (in whole or in part) in a lost packet. The presentinvention improves on these techniques by ensuring their properoperation in the case where pictures are transmitted over multipletransport layer (e.g., RTP) packets. In system implementations of thepresent invention, the reliable transmission of the L0 pictures may beperformed by LRP modules (e.g., FIG. 2, modules 270A and 270B, and FIG.4, modules 470A and 470B) using positive or negative acknowledgmentsbetween the sending and receiving counterparts according to a suitableprotection protocol.

FIG. 12 shows an exemplary picture coding structure 1200 (which is alsodescribed in International patent application No. PCT/US06/061815) inwhich the L0 base and L1-L2 temporal enhancement layers are coupled withat least one reliably transmitted base layer picture for error resilientvideo communications. In coding structure 1200, in addition toconventional base and enhancement picture types that are labeled asL0-L2 pictures, there is a new picture type called LR (‘R’ forreliable). It is noted that in coding structure 1200 shown in FIG. 12,the layers LR and L0-L2 can equivalently have been labeled as L0-L3,respectively, since the LR pictures always are the lowest temporal layerof the coded video signal. In accordance with the present invention forerror resilient video communications, the LR pictures, which may be Ppictures, are designated to be reliably delivered to receiverdestinations.

The operation of the inventive error resilient techniques can beunderstood by consideration of an example in which one of the L0pictures is damaged or lost due to packet loss. As previously noted, intraditional communication systems the effect of loss of the L0 pictureis severe on all subsequent L0-L2 pictures. With the picture codingstructure 1200, the next “reliably-delivered” LR picture after a lost L0picture offers a resynchronization point, after which point thereceiver/decoder can continue decoding and display without distortion.

In the coding structure 1200 shown in FIG. 12, the temporal distancebetween the LR pictures is, for example, 12 frames. The reliabledelivery of the LR pictures exploits the fact that P pictures with verylong temporal distances (6 frames or more) are about half the size of anI picture, and that the reliable delivery is not intended to ensuretimely display of the relevant picture, but instead is intended forcreation of a suitable reference picture for future use. As a result thedelivery of an LR picture can be accomplished by a very slight bandwidthincrease in the system during the period between successive LR pictures.

Coding structure 1200 may be implemented using the existing H.264standard under which the LR pictures may, for example, be stored at adecoder as long-term reference pictures and be replaced using MMCOcommands.

FIG. 13 shows an exemplary picture coding structure 1300 where the LRpicture concept is applied to enhancement layer pictures (either spatialor quality scalability). Here, the pictures to be reliably transmittedare labeled SR, and as with LR pictures, they constitute the lowesttemporal layer of the spatial or quality enhancement layer.

It is noted that although the LR pictures concept is generally describedherein for purposes of illustration, as applied to the lowest temporallayer of the coded video signal, the concept can also be extended andapplied to additional layers in accordance with the principles of thepresent invention. This extended application will result in additionalpictures being transported in a reliable fashion. For example, withreference to FIG. 12 in addition to the LR pictures, the L0 picturescould also be included in the reliable (re)transmission mechanism.Similarly, pictures of any spatial/quality enhancement layers (from thelowest or additional temporal layers) may be included. Further, videosequence header or other data may be treated or considered to beequivalent to LR pictures in the system so that they (header or otherdata) are reliably transmitted. In the following, for simplicity indescription we assume that only LR pictures are reliably transmitted,unless explicitly specified otherwise. However, it will be readilyunderstood that additional layers or data can be reliably transmitted inexactly the same way.

It is desirable that the bandwidth overhead for the reliable delivery ofthe LR frames is zero or negligible, when there are no packet losses.This implies that a dynamic, closed-loop algorithm should be used forthe reliable delivery mechanism. It may also be possible to use openloop algorithms, where, for example, an LR frame is retransmittedproactively a number of times.

International patent application No. PCT/US06/061825 describes severalmechanisms to notify a sender (e.g., SENDER, SVCS1, or SVCS2) that aparticular LR picture has been received by an intended receiver, andalso techniques for dynamically establishing LR pictures. Using RTCP orother feedback mechanisms, the sender can be notified that a particularreceiver is experiencing lost packets using, for example, the positiveand negative acknowledgment techniques described therein. The feedbackcan be as detailed as individual ACK/NACK messages for each individualpacket. Use of feedback enables the encoder to calculate (exactly orapproximately) the state of the decoder(s), and act accordingly. Thisfeedback is generated and collected by Reliability and Random accessControl (RRC) modules 530 (FIG. 6).

An important aspect of these sender-notification mechanisms is thetechnique by which a receiver (receiving endpoint or SVCS) detects theloss of an LR picture with minimal delay. The technique used in theaforementioned patent application relies on LR picture numbers andpicture number references.

The LR picture numbers technique operates by assigning sequentialnumbers to LR pictures, which are carried together with the LR picturepackets. The receiver maintains a list of the numbers of the LR picturesit has received. Non-LR pictures, on the other hand, contain thesequence number of the most recent LR picture in decoding order. Thissequence number reference allows a receiver to detect a lost LR pictureeven before receipt of the following LR picture. When a receiverreceives an LR picture, it can detect if it has lost (i.e. not received)one or more of the previous LR pictures by comparing the picture numberof the received LR picture with the list of picture numbers itmaintains. The picture number of the received LR picture should be onemore than that of the previous one, or 0 if the count has restarted.When a receiver receives a non-LR picture, it tests to see if thereferenced LR picture number is present in its number list. If it isnot, the referenced LR picture is assumed to be lost and correctiveaction may be initiated (e.g., a NACK message is transmitted back to thesender). It is noted that detection of lost LR pictures using the LRpicture number technique can be performed both at a receiving endpointas well as an intermediate SVCS. The operation is performed, e.g., atthe LRP (Rev) module 270B in FIG. 2 and FIG. 3, or 470B in FIG. 4.

A potential limitation of the picture numbers technique can manifestitself when a single LR picture is transported using more than onepacket. Such transport may occur, for example, if encoding is done usingmultiple slices, but can occur whenever the coded bits of a givenpicture exceed the maximum transport layer packet size. When multiplepackets are used to transport a picture, all the packets will have thesame picture index value since they belong to the same picture. If allsuch packets are lost in transit, then the receiver can properly detectthe loss upon the next successful reception of picture data. If,however, in the case of partial data reception in which only some of thepicture's packets are lost (but a few of the packets are received) areceiver will not be able to detect the loss, unless it examines thedata to determine if all macroblocks contained in the picture areincluded in the received data. This determination, which requires thatthe receiver parse coded video data, is a computationally demandingtask. In the H.264 or H.264 SVC cases, for example, determining if a setof slices includes data for an entire packet requires parsing of theentire slice header. The parsing operation can be performed in areceiver that is equipped with a decoder. However, such is not the casewhen the receiver is an SVCS or any other type of MANE.

To address error resilience in the case of partial data reception, it isnoted that a receiver can detect packet losses using the sequence numberassociated with every packet (e.g., RTP sequence numbers in a preferredembodiment where RTP is used as the transport protocol). Successivepackets of an LR picture will contain successive RTP sequence numbers.If partial data is received, a receiver knows from the gap in thereceived RTP sequence numbers that some data was lost, but it cannotdetermine if the lost data correspond to portion of the LR picture ordata from a following picture. As a result, from the RTP sequencenumbers alone, it is not possible to detect if the received datacontains the entire LR picture. To enable a receiver to detect receiptof the entire picture, the present invention introduces two flags, astart bit flag and an end bit flag, that respectively indicate the firstand last packets containing data of an LR picture.

Upon reception of packet of an LR picture, a receiver can examine itsRTP sequence number and check if it has received all previous packetswith successively smaller RTP sequence numbers until reaching a packetthat has the same picture index value and in which the ‘start’ bit isset. Similarly, it can continue checking that successive packets withsuccessively larger RTP sequence numbers are received, until reaching apacket that has the same picture index value and in which the ‘last’ bitset. With this modification, frame indices can be used to detect lossesof lowest temporal level pictures in both cases when no data is receivedand when partial data is received.

The two flags also may be introduced in temporal levels higher than thelowest temporal level to enable integrity detection for picturesbelonging to higher temporal levels. This—coupled with RTP sequencenumbers—would allow a receiver to quickly determine if it has receivedall needed data for a particular picture, regardless of its temporallevel.

It is noted that RTP marker bit has a usual definition for use in videotransport as “the last packet of a picture.” Use of the RTP marker bitmay be considered in lieu of the ‘last’ flag. However, in the context ofSVC, such use of the RTP marker bit is not sufficient to solve theproblem this invention addresses, since a picture may include several‘pictures’ (base and enhancement layers). Furthermore, such a changewould create problems in existing RTP systems that already incorporatethe usual interpretation of the RTP marker bit.

Two different embodiments of the modified LR picture numbering techniqueare described herein. One embodiment (hereinafter referred to as the ‘Rpackets’ technique) is appropriate when the RTP protocol is used by thesystem for transmission. The other embodiment is applicable when theH.264 SVC draft standard is used for the system.

For the R packets technique, assume that the RTP protocol (over UDP andIP) is used for communication between two terminals, possibly throughone or more intermediate servers. Note that the media transmittingterminal may perform real-time encoding, or may access media data fromlocal or other storage (RAM, hard disk, a storage area network, a fileserver, etc.). Similarly, the receiving terminal may perform real-timedecoding, and it may be storing the received data in local or otherstorage for future playback, or both. For the description herein, it isassumed, without limitation, that real-time encoding and decoding aretaking place.

FIG. 14 shows the architecture of the transmitting terminal's LRP Sndmodule (e.g., module 270A, FIG. 2). LRP Snd module includes a packetprocessor (R-Packet Controller 1610) with local storage (e.g.,. buffer1605) for packets that may require retransmission). R-Packet Controller1610 marks the R packets and also responds to RNACKs. The R-PacketController is connected to a multiplexer MUX 1620 and a demultiplexerDMUX 1630 implementing the RTP/UDP/IP protocol stack. Although MUX 1620and DMUX 1630 are shown in FIG. 14 as separate entities, they may becombined in the same unit. MUX 1620 and DMUX 1630 are connected to oneor more network interface controllers (NICs) which provide the physicallayer interface. In a preferred embodiment, the NIC is an Ethernetadapter, but any other NICs can be used as will be obvious to personsskilled in the art.

Similarly, FIG. 15 shows an exemplary architecture of the receivingterminal's LRP Rev module (e.g., module 270B, FIG. 2). The R-PacketController here (e.g., controller 1610′) is responsible for packet lossdetection and generation of appropriate NACK messages. Further, FIG. 16shows the structure of the server's LRP Snd and Rev modules (e.g.,modules 420A and 420B, FIG. 4), which may be the same as components of areceiving terminal and that of a transmitting terminal connectedback-to-back.

In a preferred embodiment, the transmitting terminal packetizes mediadata according to the RTP specification. It is noted that that althoughdifferent packetization (called “payload”) formats are defined for RTP,they all share the same common header. This invention introduces a namedheader extension mechanism (see Singer, D., “A general mechanism for RTPHeader Extensions,” draft-ietf-avt-rtp-hdrext-01 (work in progress),February 2006) for RTP packets so that R packets can be properlyhandled.

According to the present invention, in an RTP session containing Rpackets, individual packets are marked with the named header extensionmechanism. The R packet header extension element identifies both Rpackets themselves and previously-sent R packets. This header extensionelement has the name “com.layeredmedia.avt.r-packet/200606”. Every Rpacket includes, and every non-R packet should include, a headerextension element of this form.

FIG. 17 shows an exemplary data field format of the inventive namedheader extension, in which the fields are defined as follows.

ID: 4 bits

-   -   The local identifier negotiated for this header extension        element, as defined, for example, in Singer, D., “A general        mechanism for RTP Header Extensions,”        draft-ietf-avt-rtp-hdrext-01 (work in progress), February 2006.

Length (len): 4 bits

-   -   The length minus one of the data bytes of this header extension        element, not counting the header byte (ID and len), This will        have the value 6 if the second word (the superseded range) is        present, and 2 if it is not. Thus, its value must either be 2 or        6.

R: 1 bit

-   -   A bit indicating that the packet containing this header        extension element is an R packet in series SER with R sequence        number RSEQ. If this bit is not set, the header extension        element instead indicates that the media stream's most recent R        packet in series SER had R sequence number RSEQ. If this bit is        not set, the superseded range should not be present (i.e. the        len field should be 2) and must be ignored if present.

Reserved, Must Be Zero (0): 1 bit

-   -   Reserved bit. This must be set to zero on transmit and ignored        on receive.

Start (S): 1 bit

This must be set to one if this is the first packet containing data froma given picture.

End (E): 1 bit

-   -   This must be set to one if this is the last packet containing        data from a given picture.

Series ID (SER): 4 bits

-   -   An identifier of which series of R packets is being described by        this header extension element. If a media encoder is describing        only a single series of R packets, this should have the value 0.        For example, using the scalable video picture coding structure        depicted in FIG. 13, L packets (base spatial enhancement layer,        all threads) would have SER set to, say, 0, and S packets        (spatial enhancement layer, all threads) would have SER set to        1.

R Packet Sequence Number (RSEQ): 16 bits

-   -   An unsigned sequence number indicating the number of this R        packet within the series SER. This value is incremented by 1        (modulo 2̂16) for every R packet sent in a given series. RSEQ        values for separate sequences are independent.

Start of Superseded Range (SUPERSEDE_START): 16 bits

-   -   The R sequence number of the earliest R packet, inclusive,        superseded by this R packet, calculated modulo 2̂16. (Since this        value uses modulo arithmetic, the value RSEQ+1 may be used for        SUPERSEDE_START to indicate that all R packets prior to the end        of the superseded range have been superseded.) This field is        optional, and is only present when len=6.

End of Superseded Range (SUPERSEDE_END): 16 bits

-   -   The R sequence number of the final R packet, inclusive,        superseded by this R packet, calculated modulo 2̂16. This value        must lie in the closed range [SUPERSEDE_START . . . RSEQ] modulo        2̂16. This field is optional, and is only present when len=6.

The operation of an error resilient video communication system inaccordance with the present invention is the same or similar to theoperation described in International patent application No.PCT/US06/61815, except from the use of the ‘S’ and ‘E’ flags. Theseflags are used at the receiver in combination with RTP sequence numbersto detect if an LR picture has been received in its entirety (in whichcase no corrective action is needed) or partially (in which casecorrective action must be initiated). All other aspects of the system'soperation including the various retransmission techniques (e.g.,positive or negative acknowledgments) remain the same.

An RTP packet may contain multiple R packet mark elements, so long aseach of these elements has a different value for SER. However, an RTPpacket must not contain more than one of these header extension elementswith the R bit set, i.e. an R packet may not belong to more than oneseries.

All RTP packets in a media stream using R packets should include a markelement for all active series.

When the second word of this header extension element is present, itindicates that this R packet supersedes some previously-received Rpackets, meaning that these packets are no longer necessary in order toreconstruct stream state. This second word must only appear in a headerextension element which has its R bit set.

An R packet can only supersede R packets in the series identified by theelement's SER field. R packets cannot supersede packets in other series.

It is valid for a superseded element to have SUPERSEDE_END=RSEQ. Thisindicates that the R packet supersedes itself, i.e., that this R packetimmediately becomes irrelevant to the stream state. In practice, themost common reason to do this would be to end a series; this can be doneby sending an empty packet (e.g. an RTP No-op packet, see Andreasen, F.,“A No-Op Payload Format for RTP,” draft-ietf-avt-rtp-no-op-00 (work inprogress), May 2005.) with the superseded range (SUPERSEDE_START,SUPERSEDE END)=(RSEQ+1, RSEQ), so that the series no longer contains anynon-superseded packets.

The first R packet sent in a series should be sent with the supersededrange (SUPERSEDE_START, SUPERSEDE_END)=(RSEQ+1, RSEQ−1), to make itclear that no other R packets are present in the range.

R packets may redundantly include already-superseded packets in therange of packets to be superseded.

The loss of R packets is detected by the receiver, and is indicated bythe receiver to the sender using an RTCP feedback message. The R PacketNegative Acknowledgement (RNACK) Message is an RTCP Feedback message(see e.g., Ott, J. et al., “Extended RTP Profile for RTCP-basedFeedback(RTP/AVPF),” RFC 4585, July 2006) identified, as an example, byPT=RTPFB and FMT=4. Other values can be chosen, in accordance with thepresent invention. The FCI field must contain at least one and maycontain more than one RNACK.

The RNACK packet is used to indicate the loss of one or more R packets.The lost packet(s) are identified by means of a packet sequence number,the series identifier, and a bit mask.

The structure and semantics of the RNACK message are similar to that ofthe AVPF Generic NACK message.

FIG. 18 shows the exemplary syntax of the RNACK Feedback ControlInformation (FCI) field in which individual fields are defined asfollows:

R Packet Sequence Number (RSEQ): 16 bits

-   -   The RSEQ field indicates a RSEQ value that the receiver has not        received.

Series ID (SER): 4 bits

-   -   An identifier of which sequence of R packets is being described        as being lost by this header extension element.

Bitmask of following Lost R Packets (BLR): 12 bits

-   -   The BLR allows for reporting losses of any of the 12 R Packets        immediately following the RTP packet indicated by RSEQ. Denoting        the BLP's least significant bit as bit 1, and its most        significant bit as bit 12, then bit i of the bit mask is set to        1 if the receiver has not received R packet number (RSEQ+i) in        the series SER (modulo 2̂16) and indicates this packet is lost;        bit i is set to 0 otherwise. Note that the sender must not        assume that a receiver has received an R packet because its bit        mask was set to 0. For example, the least significant bit of the        BLR would be set to 1 if the packet corresponding to RSEQ and        the following R packet in the sequence had been lost. However,        the sender cannot infer that packets RSEQ+2 through RSEQ+16 have        been received simply because bits 2 through 15 of the BLR are 0;        all the sender knows is that the receiver has not reported them        as lost at this time.

The structure of the RNACK message shown in FIG. 18 is identical to theone described in International patent application No. PCT/US06/061815.

The second exemplary detection technique, which allows a receiver todetect that an LR picture (including SR pictures) has been lost with aminimal delay, is applicable to the systems based on the H.264 SVC draftstandard. In such case H.264 SVC NAL units are used as the basis fortransmission. International patent application No. PCT/US06/61815describes how the LR picture index technique may be applied in this caseas well. As with the RTP embodiment, the present invention introducestwo single-bit flags to address the case where multiple packets are usedfor transport of a given LR picture.

FIG. 19 shows the structure of the inventive H.264 SVC NAL headerextension modified to include the start and end flags, using as thebasis the syntax of the H.264 SVC draft (see e.g., T. Wiegand, G.Sullivan, J. Reichel, H. Schwarz, M. Wien, eds., “Joint Scalable VideoModel 8: Joint Draft 8 with proposed changes,” Joint Video Team, Doc.JVT-U202, Hangzhou, October 2006, incorporated herein by reference inits entirety). The start and end flags are the pic_start_flag andpic_end_flag, whereas the picture index is the t10_pic_idx parameter.The dependency_id (D), temporal_level (T), and quality_level (Q) fieldsindicate points in the spatial/coarse grain quality, temporal, andfine-grain quality dimensions respectively. In other words, theyindicate the position of the NAL's payload in the set of resolutionsprovided by the scalable encoder. It is noted that the base layer inthis scheme is identified by D=Q=T=0.

While there have been described what are believed to be the preferredembodiments of the present invention, those skilled in the art willrecognize that other and further changes and modifications may be madethereto without departing from the spirit of the invention, and it isintended to claim all such changes and modifications as fall within thetrue scope of the invention. For example, alternative mechanisms forindicating the LR picture frame index value and referring to it innon-LR pictures may be used in accordance with the present inventionboth within an RTP transmission context and an H.264 SVC NALtransmission context. Similarly, alternative mechanisms for indicatingthe start and end flags may be used in both RTP and H.264 SVC. Forexample, the t10_pic_idx parameter and associated pic_start_flag andpic_end_flag parameters may be carried in an SEI message.

It also will be understood that the systems and methods of the presentinvention can be implemented using any suitable combination of hardwareand software. The software (i.e., instructions) for implementing andoperating the aforementioned systems and methods can be provided oncomputer-readable media, which can include without limitation, firmware,memory, storage devices, microcontrollers, microprocessors, integratedcircuits, ASICS, on-line downloadable media, and other available media.

What is claimed is:
 1. A system for media communications between atransmitting endpoint or server and one or more receiving endpoint(s) orserver(s) over a packet-based communication network, the systemcomprising: an encoder configured to use a temporal coding structurehaving a number of different layers including a lowest temporal layer,wherein each picture is associated with a picture index number; whereindata corresponding to a single picture is portioned and transmitted inone or more individual data packets, wherein an individual data packetcomprises data elements that indicate: for the lowest temporal layerpictures, a sequence number identifying said pictures, for othertemporal layer pictures, a reference to the sequence number of the mostrecent, in decoding order, lowest temporal layer picture, and for allpictures, a ‘start’ flag and an ‘end’ flag that respectively indicate ifthe individual data packet contains the first or last data portions ofthe picture.
 2. The system of claim 1 wherein the data elementsadditionally indicate a series number associated with each spatial orquality layer, wherein the receiving endpoint or server detects if alowest temporal layer picture of a particular spatial or quality layeris lost by determining if the picture corresponding to the referencedseries number and sequence number has been received at the receivingendpoint or server.
 3. The system of claim 2 wherein the communicationnetwork uses the Internet Protocol, media transport is performed usingreal-time transport protocol (RIP), and the data elements include datato indicate presence of a lowest temporal layer frame or fragmentthereof in the packet.
 4. The system of claim 3 wherein a receivingendpoint or server sends a negative acknowledgment message to thetransmitting endpoint or server in response to the receiving endpoint'sor server's detection of a lost lowest temporal layer picture or portionof such picture.
 5. The system of claim 4 wherein the transmittingendpoint or server upon receiving the negative acknowledgment messageretransmit the lost lowest temporal layer picture or portion of suchpicture.
 6. The system of claim 1, wherein the encoder conforms to H.264Scalable Video Coding (SVC), and the data elements are carried inNetwork Adaptation Layer (NAL) unit header extension for SVC elements.7. The system of claim 1, wherein the encoder conforms to H.264 SVC, andthe data elements are carried in a Supplemental Enhancement Information(SET) message.
 8. A method for media communications between atransmitting endpoint or server and one or more receiving endpoint(s) orserver(s) over a packet-based communication network, wherein an encoderis configured to use a temporal coding structure having a number ofdifferent layers including a lowest temporal layer, and wherein datacorresponding to a single picture is portioned and transmitted in one ormore individual data packets, the method comprising: placing in eachindividual data packet data elements that indicate: for the lowesttemporal layer pictures, a sequence or index number identifying saidpictures, for other temporal layer pictures, a reference to the sequencenumber of the most recent, in decoding order, lowest temporal layerpicture, and for all pictures, a ‘start’ flag and an ‘end’ flag thatindicate if the individual data packet contains the first or last,respectively, data portions of the picture.
 9. The method of claim 8,wherein the data elements additionally indicate a series numberassociated with each spatial or quality layer, so that the receivingendpoint or server detects if a lowest temporal layer picture of aparticular spatial or quality layer is lost by determining if thepicture corresponding to the referenced series number and sequencenumber has been received at the receiving endpoint or server.
 10. Themethod of claim 8, wherein the communication network uses the InternetProtocol, media transport is performed using real-time transportprotocol (RTP), and the data elements include data to indicate presenceof a lowest temporal layer picture or fragment thereof in the packet.11. The method of claim 10, wherein a receiving endpoint or server sendsa negative acknowledgment message to the transmitting endpoint or serverin response to the receiving endpoint's or server's detection of a lostlowest temporal layer picture or portion of such picture.
 12. The methodof claim 11, wherein the transmitting endpoint or server upon receivingthe negative acknowledgment message retransmit the lost lowest temporallayer picture or portion of such picture.
 13. The method of claim 12,wherein the encoder conforms to H.264 Scalable Video Coding (SVC), andthe data elements are carried in Network Adaptation Layer (NAL) unitheader extension for SVC elements.
 14. The method of claim 8, whereinthe encoder conforms to H.264 SVC, and the data elements are carried ina Supplemental Enhancement Information (SEI) message.
 15. Anon-transitory computer readable medium comprising a set of instructionsto direct a processor to perform the method recited in claim
 8. 16. Anon-transitory computer readable medium comprising a set of instructionsto direct a processor to perform the method recited in claim
 9. 17. Anon-transitory computer readable medium comprising a set of instructionsto direct a processor to perform the method recited in claim
 10. 18. Anon-transitory computer readable medium comprising a set of instructionsto direct a processor to perform the method recited in claim
 11. 19. Anon-transitory computer readable medium comprising a set of instructionsto direct a processor to perform the method recited in claim
 12. 20. Anon-transitory computer readable medium comprising a set of instructionsto direct a processor to perform the method recited in claim 14.